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I. 



INTRODUCTION 



Considerable effort has been spent by the Naval Personnel 
Research and Development Center (NPRDC) , to develop a model 
that would enable the Navy to forecast future states of the 
enlisted force structure. This model, entitled FAST, (see [2], 

[4] and [5]) is a highly comprehensive model that involves 
acquisitions, losses, and advancements as well as a large number 
of subcategories of these variables of the Navy personnel force. 

FAST has been used successfully in the past few years as a 
long-range planning tool as well as for researching the behavior 
of the enlisted force. Due to the complexity of the model its 
operation requires a large amount of data processing and computer 
time . 

In an attempt to increase the flexibility of FAST, this research 
effort concentrated on a single variable of the personnel force: 
losses. Since forecasting future losses is one of the major tasks 
of FAST, it was considered important to attempt to simplify that 
single aspect of FAST. 

II. THE FORECASTING PROBLEM 

The enlisted Navy force is organized and managed along the 
lines of ratings, that is, job skills within the Navy. Consequently, 
the job of forecasting losses must be done for each rating indi- 
vidually. In addition, losses categorized by length of service 
and pay grade simultaneously are preferred, so that the effects 
of projected losses on the force structure can be forecast as well. 
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When all of the above variables are considered simultaneously, 
the population of individuals being considered is greatly 
dim;Lnished. For example, while the number of E-5's with 15 
years of service may be several hundred, the number of Electronic 
Technicians who are E-5 with 15 years service is slight. 

This problem of sparse data makes the task of accurate fore- 
casting difficult. Procedures for forecasting are all predicated 
on some statistical stability in people's actions. This stability 
comes about with large populations of individuals whose reactions 
are similar. With the small populations that are inherent in 
sparse data, the consequent lack of statistical stability makes 
reliable forecasting difficult at best. 

To help overcome the problems caused by sparse data, the 
populations can be recombined to form fewer groups of larger 
sizes. A natural choice for this combination, or pooling of data, 
is along the lines of ratings. That is, if ratings which exhibit 
similar loss behavior statistically are identified and grouped, 
or clustered together, the resulting clusters can be used in place 
of ratings to gain some statistical stability. The pooling of d^ta 
in clusters of ratings is sought only to improve the estimates of 
loss characteristics and of certain parameters in statistical 
models. The forecasting of losses for each rating can still be 
accomplished. This then is one reason for finding clusters of 
Navy ratings which exhibit similar loss behavior. Other applica- 
tions of the clustering would be to identify groups of ratings 
to which common policies regarding loss and retention might be 
applied. The following sections of this report describe approaches 
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to identifying the clusters and a procedure for estimating their 
possible effectiveness in improving forecasts. 

For the purpose of our analysis, losses were defined to include 
losses for all reasons, from all pay grades and length of service 
cells. Actual prediction of losses is more complex, involving 
many variables, as described in [ 2 ] and [ 4 ] . 

III. HI ERARCHICAL CLUSTERING 

A ommon technique for clustering is the Hierarchical clustering 
method. We will give a brief description of the method here. Ref 
[1] provides more details. 

The hierarchical clustering approach groups objects, in our 
case Navy ratings, into several sets of cli iters, each one contained 
in the previous one. Figure 1 shows a small example of the result 
for 5 objects. 

The tree structure in Figure 1, called a dendrogram, indicates 
how this procedure formed the groups of clusters. The order shown 
here is not unlike the groupings which occur in biological taxonomy, 
where all life forms are grouped, first into species, then into 
genera, then into families, and so on. This method may appropriately 
be called numerical taxonomy. 

The dendrogram in Figure 1 shows the 5 individual objects being 
grouped into two groups, objects 1 and 2, and objects 3, 4, and 5. 

This is the first grouping beyond the base level of 5 singleton 
groups. A more coarse grouping brings all 5 objects into a single 
set. The distance scale provides a measure of selectivity in forming 
the groups. If the "distance" allowed between objects to be clustered 
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Figure 1: A Dendrogram for Hierarchical Clustering 
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together is 10, then just two groups are formed. This criterion 
must be increased to 90 before the first two groups become one, 
thus indicating that the cluster of two groups is probably natural, 
while a clustering into one group is probably not. The interpre- 
tation of what groupings are natural is somewhat subjective if 
based only on the dendrogram. As described later, the clusters 
in this application are evaluated apart from the dendrogram. 

In order to produce a dendrogram, a "distance" between each 
pair of objects must be specified. In this application, the objects 
are enlisted Navy ratings, and the distance between two ratings should 
measure the proximity of their loss behavior. The distance function 
chosen for this purpose is 



where 



d (k ,m) 



i=l 



(«.. 1 ) 
i,k i,m 



1/2 



d(k,m) = distance between rating k and m 
X.. , = loss rate from rating k in year i 

1 / K 

= loss rate from rating m in year i 
p is a parameter, 0<p<l 



and years are indexed with 1966 for i = 1, 1967 for i = 2,..., 1972 
for i = 7. These years are being used simply because they comprise 
the data base for the research project. The parameter p is in- 
cluded to weigh the recent years greater. Thus, two ratings are 
judged "close" by this criterion if their loss rates are close, 
especially in recent years. The specific value for the parameter 
p remains to be determined by the methods discussed in a later 



section. 
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Once a distance between ratings has been defined, it is 
necessary to define a distance function between subsets of 
ratings. This is necessary for the hierarchical clustering 
algorithm to be defined. While many definitions of distance 
between subsets are possible, two were investigated and one 
finally used. The "maximum metric" is defined to be the maximum 
of all distances between pairs of objects, one choosen from each 
subset. If and C 2 are two subsets of ratings, we have 

^max^*"l'^ 2 ^ = Max{d (k,m) |k£Ci,meC 2 } . 

The "minimum metric" is analogously defined, with MIN replacing 
MAX in the above definition. 

Under the maximum metric, two subsets of ratings are close 
only if all ratings are close to each other. The minimum metric 
only requires that two ratings in the subsets be close, while 
others may be distant, for the subsets to be close. These two 
definitions generate strikingly different dendrogram shapes as 
illustrated later. 

IV. CLUSTERING BY CORRELATION 

1. Correlating Population Size and Corresponding Loss Rate. 

Examination of the data on population sizes and loss 
rates in various ratings over the years 1966-72 suggested that 
ratings may be grouped on the basis of whether their population 
size correlates positively or negatively (and to what extent) 
with their corresponding loss rate. 
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For example, it appears that some ratings, such as Quarter- 
master (200 QM) , have their loss rate increase (or decrease) 
together with their population size over the years 1966-72. At 
the same time, other ratings, such as Construction Recruit (6000 
CR) , have their population size and loss rate tend (in most cases) 
in opposite directions from one year to the next. 

The correlation between population size and loss rate was 
studied for all ratings and "All Navy" over the seven data points, 
provided by the years 1966-72. In addition to measuring the 
correlation directly for these data points, rank correlation was 
also used, since the actual magnitude of the changes in population 
size seemed both unimportant and incongruous when compared to changes 
in the loss rate. 

Two different rank correlation coefficients were used. These 
(see [1]) are defined below in terms of the rankings, P^,...,P^, 
of the seven population sizes, over the years 1966-72, of a given 
rating and the rankings of the seven corresponding 

loss rates. 



(i) Spearman's Rho: 

Let D. =P. , i=l,...,7 

111 

be difference in the rankings. 



1 2 

Then p = 1 - I 

i=l ^ 

(ii) Kendall's Tau: 

( +1 if fE . 

1 D' ' 1 3 



j +1 if (P,-P^) (£;-£^) >0, 
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Then 
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( iii) Ordinary Correlation Coefficient: 

If and denote the actual magnitude of the population 

sizes and corresponding loss rates respectively of a rating over the 
years 1966-72, the correlation coefficient is defined as 



r = 



I (P-P)(fi..-I) 
i=l ^ ^ 



I (P.-P)^ I iljl) 
i=l ^ i=l ^ 



1/2 



where 



= 7 I P- 

^ i=l ^ 



and H = i ^ 

^ i=l ^ 



Each of these correlation coefficients provides a method of 
clustering of ratings. Kendall's Tau seemed, perhaps, the most 
accommodating in providing clusters that separate in a somewhat 
natural way. Thus, three clusters may be formed on the basis 
of the values of Kendall's Tau; 
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Table 1 shows a histogram of loss rates for ratings against their 
T-values. Each of the three clusters may be broken into further 
subclusters in various ways based on the loss rates of the ratings 
in each cluster. Such methods are suggested in the next subsection. 

2. Correlating Loss Rates with All Navy Population Size. 

If the above procedure for clustering ratings is to be 
useful it should provide a procedure for forecasting future loss 
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rates through the use of clusters. Since the above clusters are 
obtained by correlating loss rates of ratings with the corresponding 
population sizes, one would have to have reasonably accurate esti- 
mates of future population sizes in each rating in order to fore- 
cast corresponding loss rates (and then actual losses) . It seems 
unlikely that such estimates would be available for each rating 
and certainly not several years in advance. If good estimates 
of population sizes will be available for future years at all 
it will be for "All Navy" only. For that reason, it appears 
desirable to correlate loss rates of ratings with "All Navy" popu- 
lation size. The three correlation coefficients defined above 
are again relevant with the only change that P^,...,P^ now denote 
the "All Navy" population sizes, or their rankings, over the years 
1966-72. Table 2 presents the lists of ratings in three clusters 
formed on the basis of Kendall’s Tau. The three clusters are: 
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All three of these clusters may be considered too big and in any 
case loss rates of ratings within each cluster vary widely. Since 
clusters are envisioned as groups of ratings of like loss rates 
it is necessary to break each of the above clusters into further 
subclusters. (The same remark applies when clustering is accom- 
plished based on correlating each loss rate with its own population 
size. ) 

Further subclusters may be formed by selecting one of 
several candidate statistics, such as: 
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0.43 


1800 


PN 


pepsonnflman 


20.31 


25. 19 


25.91 


30.20 


31.86 


25.61 


22. 19 


0.43 


9500 


DC 


OANAGE CONTROL 


20.41 


28 .94 


24.61 


32.27 


91. 86 


29.09 


17.69 


0.52 


900 


ST 


sonar TECHNI(1ANS(3» 


17.01 


23.52 


20. 93 


24.32 


2 7.75 


15.70 


18.16 


O^ 


1000 


ET 


ELECTRONICS Tt CHN 1 C I ANS ( 3 I 


18.39 


23.79 


29.01 


24.21 


2 5.60 


13.97 


13.69 


0.71 


800 


FT 


FIRE CCNTPCL T FCHN I C IANS |9 ) 


19 .12 


26 . 18 


22.26 


25.25 


27. 72 


1 8.55 


16.01 


0.90 



Table 2 
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(i) The mean loss rate of ratings over the seven years; 

(ii) The median loss rate of ratings over the seven years; 

(iii) The mean or median loss rate of ratings over the last 
three years only; 

(iv) The loss rate of ratings of the last year only. 

For demonstration purposes, one of these statistics, namely 
the median loss rate of ratings over the three years 1970-72, was 
selected. Figure 2 shows each of the ratings (and "All Navy") 
represented by its median loss rate over the years 1970-72. The 
three clusters referred to above are separated in the graph. The 
graph itself suggests further subclusters based on the size of the 
loss rates. For example. Cluster A may be grouped in four sub- 

/ _n \ 

clusters based on the median loss rate Jl . ’ of (ii) : 

1 



(a) 


Ratings 


in 


Cluster 


A 


with 


VI 

o 


1 




20 % 




(b) 


Ratings 


in 


Cluster 


A 


with 


20 % < 


1 




27% 


(A^) 


(c) 


Ratings 


in 


Cluster 


A 


with 


27% < 


1 


< 


33% 


(A 3 ) 


(d) 


Ratings 


in 


Cluster 


A 


with 


33% < 


1 




100 % (A^) 



Similar subclusters may be formed within Clusters B and C. These 
are indicated in Figure 2 by vertical lines drawn as boundaries 
between neighboring subclusters. 

Shortcomings of this method are that it is quite "ad hoc" in 
selecting the boundaries between clusters and subclusters. Also, 
since at the start clusters are formed based on values of the 
correlation coefficients, ratings of similar losses may be found 
in separate clusters. Thus, e.g. many ratings in Cluster C have 
loss rates closer to those of some ratings in Cluster B than those 
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of ratings in their own subcluster. This may be regarded as a 
disadvantage if one considered it an overriding necessity to 
cluster by like loss rates. On the other hand, ratings with similar 
loss rates may be placed in different clusters, because these loss 
rates may be tending in opposite directions over the years. It 
may be desirable in such cases to group such ratings separately 
despite their like loss rates. 

Because of the ad hoc nature of this clustering method it 
was not used in the rest of this research effort. 

V. EVALUATION OF HIERARCHICAL CLUSTERS 

The methods described above lead to various clusterings or 
partitions of the enlisted ratings. In this section, we describe 
how any such partition was evaluated. 

Let the set of enlisted ratings be designated S, where 

S = {1,2, ... ,N} 

and N is the number of ratings being considered. In our case, 

N = 71 ratings. The total number of individual ratings is about 
130, however some of the 130 are service ratings which support a 
general rating. In these instances, several service ratings con- 
tain men specializing in a similar area, usually at the middle 
paygrades such as E4 to E6 or E7 . A single general rating associated 
with these service ratings contains all men at the pay grades beyond 
those of the service rating, in the common area. The general rating 
then contains the foremen and line managers for the men in the 
service ratings. When this occured, all the service ratings and 
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its associated general rating were combined into a pseudo rating 
for the analysis. This avoided having ratings with only a few 
pay grades. The common technical skill areas of these ratings 
made their prior combination seem natural, and reduced the number 
of ratings analyzed to 71. A few recent ratings with no history 
in our data base were left out, as they were a special case and 
quite few in number. The following table shows the definition of 
ratings used for the study, with the actual rating codes included 
in each of our ratings. 

With the ratings as defined above, a partition or clustering 

of S is a set of subsets C, of S for which 

k 

C, n C . = 0 if k?^j 
k ■] 

UC. = S 
k ^ 

If there are m subsets (k=l , . . . ,m) , the partiticn is said 
to be of size m. .Many partitions, suggested primarily by the 
hierarchical clustering method, were evaluated by a method des- 
cribed below. 

This research investigation was conducted for th-^ express 
purpose of finding out if the prediction of losses by forecasting 
loss rates could be improved when data was pooled among ratings in 
clusters, for some systematically well-defined clustering. The 
approach was to forecast losses by a method approximating the one 
actually used, and for which the clustering was originally intended. 
The forecasting was done for the year 1973 (fiscal year) , using 
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RATINGS USED IN THE STUDY 



Index in S 



Name 



Rating Codes 



1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 



Boatswains Mate 

Quartermaster 

Signalman 

Operations Specialist 
Sonar Technicians 
Torpedomans Mate 
Gunners Mates 
Gunners Mate Technician 
Fire Control Technicians 
Missile Technician 
Mineman 

Electronics Technicians 

Data Systems Technician 

Instrumentman 

Opticalman 

Radioman 

Communication Technicians 

Yeoman 

Legalman 

Personnelman 

Data Processing Technician 

Storekeeper 

Disbursing Clerk 

Commissar yman 

Ships Serviceman 

Journalist 

Postal Clerk 

Lithographer 

Illustrator Draftsman 

Musician 



100 

200 

250 

300 

400 , 401 , 404 
500 

600 , 601 , 604 
602 

800 , 801 , 802 , 803 

810 

900 

1000, 1001, 1002 
1010 
1100 
1200 
1500 

1600 , 1611 , 1622 , 

1633 , 1644 , 1655 , 1666 

1700 

1701 
1800 
1900 
2000 
2100 
2290 
2490 
2600 
2700 
3100 
3200 
3300 
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Index in S Name Rating Codes 



31 


Seaman Recruit 


3600 




32 


Machinists Mate 


3700 




33 


Engineman 


3800 




34 


Machinery Repairman 


3900 




35 


Boilerman 


4000,4020 




36 


Electricians Mate 


4100 




37 


Interior Communication 
Elec . 


4200 




38 


Hull Technicians 


4300, 4410, 4411, 


4412 


39 


Damage Control 


4500 




40 


Patternmaker 


4600 




41 


Moulder 


4700 




42 


Fireman Recruit 


5000 




43 


Engineering Aid 


5100, 5101, 5102 




44 


Construction Electrician 


5300, -1, -2, -3, 
-5, -6 


-4 , 


45 


Equipment Operator 


5410, 5411, 5412 




46 


Construction Mechanic 


5500, 5503, 5504 




47 


Builder 


5600, 5601, 5602, 


5503 


48 


Steel Worker 


5700, 5703, 5704 




49 


Utilitiesman 


5800, 5801, 5802, 
5804 


58 03 


50 


Construction Recruit 


6000 




51 


Aviation Machinists Mate 


6200, 6205, 6206 




52 


Aviation Electronics 
Technician 


6300, 6304, 6306, 


6307 


53 


Aviation Antisub Warfare 
Technician 


6310 




54 


Aviation Ordanceman 


6500 




55 


Aviation Fire Control 
Technician 


6520, 6521, 6522 




56 


Air Controlman 


6600 




57 


Aviation Boatswains Mate 


6700, 6704, 6705, 


6706 


58 


Aviation Electricians Mate 


6800 




59 


Aviation Structural Mechanic 


6900, 6901, 6902, 


6903 


60 


Aircrew Survival Equipman 


7000 





18 



Index in S 



Names 



Rating Codes 



61 


Aerographers Mate 


7100 


62 


Tradevman 


7200 


63 


Aviation Storekeeper 


7300 


64 


Aviation Maintenance Admin 


.7400 


65 


Aviation Support Equip. 
Technician 


7500 


66 


Photographers Mate 


7600 


67 


Photographic Intelligence 


7700 


68 


Airman Recruit 


7800 


69 


Hospital Corpsman 


8000 


70 


Dental Technician 


8300 


71 


Steward 


8500 



7501, 7502, 7503 



TABLE 3 



19 



data in the years 1966-72. Then, the predicted losses were 
compared to the actual losses in 1973. The prediction scheme 
was not detailed enough to be used for actually forecasting 
losses, and was only intended to be an evaluation of clustering. 
If clustering is to improve significantly the forecasting (by any 
means) , then it should improve forecasting by the elementary 
prediction scheme given below. 

To evaluate any clustering or partition Cj^ , k = l,...,m, 
the following approach was used. First, a projection of total 
losses was made for each individual rating by projecting the loss 
rate, i.e., the proportion of those on board at the year's start 
who would be lost over the year. Let 

j = Inventory (of men) at the beginning of 
year i, in rating j. 
j = Losses during year i from rating j . 
where the indices are, 

i = 1,2,.. .,7 for years 1966, 1967 , . . . , 197 2 
respectively, and 



j = 1,2, 



,N . 



The estimated loss rate in 1973 for rating j , denoted , 
was obtained from a weighted average of the actual loss rates 
in prior years. Specifically, 



I . 

3 



y ^(L. .M. .) 

^ 1,3 1,3 



i=l 



7 

I a 

i=l 



7-i 



where a is a fixed weighting factor, 0 < a < 1. This estimated 
loss rate was applied to the 1973 inventory I ^ , yielding 



2Q 



L. 

3 







as the estimated loss from rating j in 1973, using no clustering. 

The same prediction scheme was used with clustering, and both 
predictions were compared to the actual loss. To estimate the 
loss rate with clusters, let k= 1 , 2 , . . . ,m be the partition of 

the ratings being considered. Then, pooling data over clusters 
gives the formula for the common estimated loss rate of ratings 
in cluster ; 



I c’-i( I ^ I I. ) 

1=1 ifC, ifC, 



z. = 

3 



I « 

i=l 



7-i 



for every 7 ^ • Then the estimated loss is 



It should be emphasized again that the prediction scheme used 
here is not intended to be the best available for the data at hand. 
Our purpose is only to evaluate the clustering, by comparing loss 
predictions with and without clustering, using the same prediction 
scheme in both instances. 

VI. RESULTS OF CLUSTERING EXPERIMENT 
1 . Dendrograms . 

Using the distance function defined in Chapter III, two 
dendrograms were drawn for each of several values of the weighting 
factor p . The two dendrograms correspond to the maximum and the 
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minimum metrics, respectively, between clusters as defined in 
Chapter III. Figures 3 and 4 show examples of dendrograms with 
the minimum and maximum metric respectively. An undersirable 
feature of all dendrograms with the minimum metric is, as can be 
seen in Figure 3 , that separation into clusters does not occur 
until sets are at a fairly close "distance" to each other. For 
example, in Figure 4, although two clusters form at a "distance" 
of 15.60, the next separation into (three) clusters occurs at 
a "distance" of 3.12. Further separations occur at very short 
intervals, at "distance" values 2.25, 1.692, 1.688, etc. This 
makes it rather difficult to decide on the number of clusters to be 
used. In contrast. Figure 4 shows a typical dendrogram with the 
maximum metric. Here separations into clusters occur quite 
gradually at least until about ten clusters have formed. Separation 
into two, three, four, etc., clusters occur at the "distance" 
values 48.7, 29.9, 18.2, 14.3, 9.4, 7.6, etc. This provides more 
justification to choose e.g., four clusters rather than three or 
five. In choosing the appropriate number of clusters one must 
consider that, while too many clusters would defeat the purpose 
of clustering, too few clusters would result in a prediction method 
that is too crude. For this reason the proper choice is probably 
be somewhere between three and ten clusters. 

2. Evaluation of Clustering. 

In order to evaluate the effectiveness of clustering, 
the prediction scheme described in Chapter V was devised. According 
to this scheme, two estimates, and , were computed as predic- 

tions with and without clustering for the losses in 1973 from 



DENDROGRAM WITH MINIMUM METRIC 



VD 


O 


in 


<y\ 




00 


CN4 


r- 


rH 


VD 


O 


• 


• 


• 


• 


• 


• 


• 


• 


• 


• 


♦ 


in 




fN 


o 




r- 


KD 




m 


rH 


o 


r-H 


1 — 1 


1 — 1 


rH 

















22 



L- 


















cs 


DDS9 


11 


« 
















4 








1 
















• 1 


i } 


OOC0 


Cl 


♦ 
















• 4 








1 
















* 1 


mO 


0^^^ 


S2 


♦ 
















* • 4 








1 
















*• 1 


H$ 


CS*-.* 


52 


♦ 
























1 








* 












nil 


61 


♦ 
















• 4 








1 
















• * 1 


K8 


ODl 


1 


t 


■ 














4 




C^5 5 


55 


♦ 
















• * 4 








t 


__ 














•• 1 


18 


ODDV 


5C 


♦ 
















• 4 








1 
















* * 1 


( 3 


01^; 


59 


♦ 
















• ; 








1 
















• • 1 


\k 


OPll 


PI 


♦ 
















• *4 








1 
















• ‘ 1 


*5 


CD^ 


2 


♦ 
















• • 4 








1 
















• 1 


SO 


05C 


9 


♦ 
















4 








I 
















* * 1 


b? 


009X 


99 


♦ 
















• 4 








1 
















•• 1 


35 


DDC4 


99 


«- 








$ • 








4 








1 


















VD 


C5D9 


C5 


♦ 
















4 








t 


... - - 














•• 1 


in 


COPS 


69 


♦ 
















• 4 








f 
















•M 


M3 


D5dC 


fZ 


4 
















4 








1 
















** 1 


bM 


ODSf 


9f 


♦ 
















4 








1 
















•* 1 




CD58 


S9 


♦ 
























1 
















• • 1 


Hd 


DD9t 


99 


♦ 
















• ♦ 








1 
















•• 1 


dO 


05M 


12 


♦ 








4 
















t 
















•• 1 


mi 


CD9^ 


09 


♦ 
















• • 4 








1 
















* * 1 


)»S 


ODD? 


22 


♦ 
















• * 4 








t 
















** 1 


SO 


(’6^^ 


92 


♦ 
















• • 4 








1 
















••1 


fto 


CD?1 


51 


♦ 
















• 4 








1 
















••I 


7V 


CDSi 


99 


4 - 
















4 








1 
















** 1 


Ov 


00c9 


15 


♦ 
















• 4 








r 


- 




•- 










•• 1 


K B 


ODSl 


91 


♦ 
















4 








1 














* 


• 1 


15. 


CDM 


n 


♦ 
















• • 4 








1 
















* 1 


i«0 


^C9 


9 


♦ 


- • - - — 














4 








1 
















•• 1 




ODS 


9 


♦ 
















^ ^ ♦ 


PV 


0519 




1 


















1 5 


♦ 
















• 4 








1 
















•Jll 


hi 


00S9 


95 


♦ 
















• 4 
























• • 1 


a 


CD£9 


25 


♦ 


^ 














• • 4 








I 
















* 1 


Nd 


0091 


02 


♦ 
















4 










• -- 












• • • 


•• 1 


DP 


OpQT 


95 


♦ 
















4 
























_• 1 


. *41 


051C 


22 


♦ 


• 














• 4 










• 














•'•i 


r.w 


CDS 


11 


T 


• 














•• 1 


k5 


0C9 


1 


♦ 














• • • 


• 4 








1 


• — 


■ 












•• 1 


>0 


0512 


P2 


4 








































•-.1 


Id 


C5il 


15 


4 
















• 4 








1 
















•* 1 


sv 


0051 


59 


♦ 


M . - ~ ^ 














• 4 








1 
















•• 1 


bV 


DDft 


C9 


♦ 
























1 














• • • 


•• 1 


bd 


C55/ 


59 


♦ 
























1 


















DV 


00 U 


19 


♦ 
















4 








1 
















• 1 




0DS9 


S5 


♦ 














m m m 


* * 4 








1 
















* 1 


39 


C089 


35 


♦ 


• 














4 








1 
















•* 1 




ODn 


9C 


♦ 


• 












• • • 


• 4 










• 














• • 1 


bis 


052 


t 


♦ 


• 














4 








1 
















*• 1 


»d 


CD55 


29 


♦ 














• 


• 4 








1 


• 












• 


•* 1 


^4 


OOlS 


19 


♦ 
















4 








1 














• 


* * 1 


51 


0529 


1C 


4 
















4 








4 














• • 


* • 1 


DD 


O05S 


SC 


4 














• 


4 








1 
















* 1 




core 


c z 


♦ 














• • 


• * 4 








1 
















• 1 


IS 


059 


5 


♦ 
















4 








1 


















ii 


00 e 


S 


♦ 


• 














4 








4 


♦ 
















f.S 


0515 


99 


♦ 


• 












• • 


4 








1 


















bS 


C59f 


1 Z 


♦ 














• 


4 








1 


• 












• • 


••1 


59 


0555 


19 


♦ 


• 












• * • 


4 










• ^ ^ ^ ...... 
















NI 


con 


91 


♦ 












• • 


• ♦ • 


4 








4 












• • 


• • • 


* * 1 


or 


CD92 


92 


♦ 












• • 


• 


4 








1 












• • 


• 


* * 1 


ah 


0D?> 


6C 


♦ 








• • • • • 




• • • 


• 


• 4 








1 












• 


• 


•• 1 


5d 


0012 


12 


♦ 
























1 












• 




• • 1 


n 


05U 


02 


♦ 


• 










• 




♦ 








4 


• 
















hO 


ODSa 


99 


♦ 


• 














4 








1 














• 


‘ • 1 


01 


0021 


79 


♦ 
















4 








' 1 








* 






• 


• • 1 


tv 


01C9 


C5 


♦ 
















4 








1 
















• 1 


S3 


ClOl 


f I 


♦ 

4 


• 














• 1 


Xa 


CDDl 


21 


♦ 


• 














4 








1 


• 
















1^ 


010 


01 


♦ 
















4 








1 


















V2 


o0l5 


r * 



o 



DENDROGRAM WITH MAXIMUM METRIC 



o 

II 

Q- 




23 





~d5 


^ 


‘"‘rlr"'- 


'Ki 




lD 






*a\ 




• 


• 


• 


• 


• 


• 


• 


• 


• 


• 


• 


00 


ro 


o^ 




<y\ 




a\ 




CT> 




o 






m 


ro 


(N 


(N 


rH 


iH 

























C3^9 


1 / 




. . . ‘J 


Cl 


CC.'l 


79 






* V 


31 t > 


f ; 




! 




U f 


r 1 






S j 


CH 1 


» 1 




• 1 


1 i 


C? ■ 


^! 




• • 1 


m 4 


tf <0 






• • 1 


13 


C''9l 


y i 




I 


1 


c3 > 


p 




... . ] 


" 1 


( 3i 


0 




. . ,f 


C'f 


0 ^^? 


1 ; 




: **:; 


a 


T3‘,l 


?i 




: 


■4d 


cn9< 


90 






aO 


33S1 


1 ? 






79 


C3‘ 1 


♦ 9 






t 3 


C 


c*» 




; 1 


J 


333i 






\] 


S3 




* ^ 




• 1 


r. 3 


C0^1 


c . 




; 1 


Id 


C31 1 


1 9 






Sv 


cost 


c 9 




* * 1 




C'^W 


1 9 




. . . ] 


>7 


C3i 1 


f 9 




: 




33*'l 


39 














* 1 


r.9 


CObV 








;7 


f3&9 


< V 




. . . . 1 


3v 


C3'' ^ 


7 ; 




... . , 


P7 


C319 


/ V 




; . . * t 


7 


7, V 


•’V 




1 


j. y 


Cf / 






♦ 1 


4 i 


--'1 


C ’ 




• • i 


:i 




ot 








ODt > 


C t 




....: 


IS 


C3V 


c 




1 


i d 


030 


6 




; 1 


J.0 


rocs 


Cl 








1311 


tl 














; 1 


«3 


f ^^c 


hZ 




.... 1 




Ob*>Z 


s ■» 




: *' 


t.ti 


C 1 






• ‘ I 


L i 


U3CI 


P9 






153 


03' , 


... 






3v 










19 


'7 






: : :i 


jB 


019, 








M 


C3t i 






:J 


«r. 


( 0^ 


- 




* 1 


> u 


D3t 








c3 


:33V 


5 . 




. . . . j 


. 3 


TOPS 








^3 


C30C 


( ( 




.. . j 




C3Si 


- 




; 1 


a i 


r^os 


•» k 




* 1 


3^ 


r^' k. 


1 7 




; -1 


31 


03e9 


i i 




• i 


w3 


roiv 


0 ' 




; : ■' 


kS 


cs^ 


i 




• • I 


► h 


on ' 






: :i 




( Ob 


1 1 




• ; 1 




33V 


y 




* I 




COT Z 






...*.*' 


-S 


COl V 


f »• 




. . . J 




035: 


' : 




I 


K 3 


03; , 


.k 






ra 


LO^ 


.7 






« 1 


C31 1 


• 1 






r r 


033? 


" : 






• M 


C^f 


f f 




*....!* 


3d 


C3t0 


tr 




* 1 


I 3 


331 


t. » 






7 ^ 


( ->1 





24 



Rating 



When the 1973 data on losses became available, the 



actual losses, , from Rating j became known. Histograms 

were then prepared for the following expressions: 



(i) 


- 


L . = error 
3 


in prediction without clustering. 


(ii) 


L. - 

3 


L . = error 
3 


in prediction with clustering. 


(iii) 


|l. 


- £. 1 - 1l. 


- L.| = difference in absolute 


' 3 


3 ‘3 


3 ' 






errors without and with clustering. 


(iv) 


(L. 


- L . r L . = 

3 3 


normalized error in prediction 






without clustering 


(v) 


(L. 


- L.) V L. = 
3 3 


normalized error in prediction 






with 


clustering 


(vi) 


(iLj 


1 

1 


- L.|)tL. = difference in absolute 


3 ' ■ 3 


3 3 



normalized errors without and with 
clustering . 



The histograms were specifically examined for cases where the 
number of clusters was 3, 5, 7, 10, 15 and 20. 

The proper choice of value for p , the parameter used to 
weight past years according to importance in the clustering 
scheme was also investigated. The value of p could be based 
on empirical data considerations. For example, since 0 i p ^ 1 , 
the larger the value of p the more emphasis is placed on recent 
years in the data base. In this study the value of p to employ 
was based only on its effect on clustering. Figure 5 shows at 
what level of the distance scale various numbers of clusters 
formed as the value of p is changed. This Figure suggests 
that in the vicinity of p = .1 , the points on the distance 
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scale where clusters form are better separated from each other 
than is the case for other values of p . 

The choice of value for a , the parameter that weight past 
years according to their importance in the prediction scheme, 
was not specifically investigated. It seemed natural to assume 
that a = p . However, there could be convincing arguments for 
choosing a different from p . 

Among the types of histograms listed above, item (vi) was 
the most relevant for the evaluation of clustering. The "difference 
is absolute normalized errors without and with clustering" measures 
the relative success of clustering in predicting future losses 
versus the success of doing that by a comparable traditional 
method. A large number of ratings having positive values for this 
measure, especially large positive values, would indicate signi- 
ficant success of clustering. A high percentage of ratings on 
the negative side would suggest the opposite conclusion. The 
actual result, however, were not conclusive either way. A typical 
histogram is shown in Figure 6 for the case is p = .1 and seven 
clusters. The mean and median as in most other such histograms 
are moderately negative, indicating that the clustering was 
slightly disadvantageous. As more and more clusters are used the 
histograms become concentrated at the origin which is to be 
expected, as using many clusters is practically equivalent to no 
clustering at all. The choice of p did not seem to effect this 
result a great deal, although the choice of p = .5 appeared 
to be slightly more favorable to the clustering method. Figure 7 
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shows the histogram corresponding to the case p = .5 and seven 
clusters . 

The fact that the clustering method resulted in somewhat 
bigger (absolute normalized) errors than the standard predicting 
method does not render clustering totally worthless. Since in 
comparison the two methods achieve a nearly identical measure of 
success, the clustering method may have its advantages in shortening 
the data processing procedures when clustering is used. This may 
be a more relevant factor when the forcasting technique is not of 
the simple variety described here, but instead is a more complex 
one such as used in FAST described in [2] , [4] and [5] . 

The histograms presented above do not show the size of 
errors made by either the clustering or the standard forcasting 
method. The histogram presented in Figure 8 exhibits the size of 
the normalized errors when forcasting by clustering (item (V) 
above) for the case p = .1 and seven clusters. The horizontal 
scale is in percentage. The Figure shows that 58 of the 71 
ratings had a less than 25% (positive or negative) error. For one 
rating the error is shown as -100%. This is due to a rating 
(Legalman) for which there were zero losses in 1973, while the 
clustering method forecasted 464. Since the zero loss in 1973 is 
probably due to a data processing error, this large forcasting 
error seems forgivable. 

The histograms presented here are representive of the many 
more cases which were tried. The results in every case were 
essentially the same, namely one of indifference to clustering 
the data for loss rate prediction. The number of subsets in a 
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was explored, as well as the choice of the parameters p and 
a . The numerous dendrograms and histograms produced from these 
experiments remain intact with the authors. 

A by-product of this project is the identification of subsets 
of ratings with common loss behavior. Such a grouping of ratings 
would for example, sugges guidelines for the application of personnel 
policy to select groups of ratings. Other applications could be 
explored as well by simply changing the criterion by which ratings 
are judged to be close to each other. Then groupings of ratings 
could quickly and easily be identified, based on another charac- 
teristics of behavior besides loss from the service. 
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